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ABSTRACT 

Various  types  of  processor  synchronization  are  introduced  and  analyzed  with 
regard  to  execution  time  and  waiting  time.  It  is  shown  that  while  barrier 
synchronization  requires  a  large  execution  time  on  systems  with  many  processors, 
other  less  restrictive  forms  of  synchronization  do  not  have  this  drawback.  It  is  also 
shown  that  for  most  reasonable  distributions  of  processor  times,  a  relatively  small 
amount  of  time  is  spent  waiting  at  synchronization  points.  The  conclusion  is  that  if 
barriers  can  be  replaced  by  the  less  restrictive  synchronization  forms,  then,  for 
problems  with  appropriate  size  granularity,  the  synchronization  costs  on  most 
multiprocessors  will  be  small. 

1.   Introduction 

Considerable  attention  has  been  given  to  the  problem  of  synchronizing 
multiprocessors.  In  particular,  barrier  synchronization  has  been  analyzed  [1]  and  shown  to 
require  significant  execution  time  on  systems  with  large  numbers  of  processors.  To  avoid 
the  extra  cost  of  synchronizing  processors,  asynchronous  algorithms  have  been  developed 
[2,3].  In  this  paper  we  consider  several  forms  of  synchronization  and  estimate  the  time 
required  in  execution  and  in  waiting  at  the  synchronization  points.  The  goal  is  to 
determine  what  the  costs  are  for  the  various  forms  of  synchronization  and  thereby  to 
determine  which  form  is  best  (when  several  different  forms  could  be  used)  and  under  what 
circumstances  it  is  worthwhile  to  pursue  asynchronous  algorithms. 

When  an  algorithm  is  implemented  on  several  processors,  there  are  usually  points  in 
the  algorithm  at  which  some  processor  must  wait  for  data  to  be  computed  by  other 
processors,  before  it  can  proceed  with  its  own  work.  At  such  points,  some  type  of 
processor  synchronization  is  required.  The  processors  computing  the  needed  data  must 
"announce"  that  it  is  ready,  and  the  processor  that  needs  this  data  must  "check"  to  see  if  it 
is  ready  and  wait  until  the  check  is  affirmative  before  proceeding. 

Often  barrier  synchronization  is  used.  A  barrier  is  a  point  in  a  program  at  which  all 
processors  must  arrive  before  any  can  proceed.  While  barrier  synchronization  is 
sometimes  necessary,  it  can  often  be  replaced  by  other  less  restrictive  forms  of 
synchronization. 

Consider,  for  example,  the  type  of  synchronization  required  in  various  iterative 
methods  for  solving  sparse  linear  systems.  If  the  linear  system  is  denoted  by  Ax  =  b,  then 
most  iterative  methods  can  be  expressed  in  two  stages: 


Given  an  initial  guess  x°,  compute  r°  =b  -  Ax°,  and  for  m  =  1,2,--, 

(1)  Generate  an  approximate  solution  x™. 

(2)  Compute  a  residual  r""  =  b  -  Ax"". 

Stage  (1)  may  itself  consist  of  several  stages  that  require  different  types  of  processor 
synchronization,  but  consider  the  type  of  synchronization  required  between  stages  (1)  and 
(2).  Assume  that  the  matrix  A  arises  from  differencing  a  partial  differential  equation  on 
some  grid,  as,  for  example  that  pictured  in  Fig.  1,  and  that  parallelism  has  been  achieved 
by  assigning  different  sections  of  the  grid  to  different  processors. 

While  a  barrier  could  be  used  between  stages  (1)  and  (2)  -  forcing  all  processors  to 
complete  their  sections  of  x™  before  any  can  begin  work  on  r"  -  it  is  not  necessary.  Since 
the  matrix  A  couples  only  neighboring  points,  if  the  grid  is  divided  as  in  Fig.  1,  then  any 
processor  can  compute  its  section  of  r"'  as  soon  as  it  and  its  neighboring  processor(s)  have 
computed  their  sections  of  x™.  Thus,  instead  of  a  barrier  one  could  use  neighbor 
synchronization  -  forcing  each  processor  to  complete  its  section  of  stage  (1)  and  wait  for  its 
neighboring  processor(s)  to  complete  their  sections  of  stage  (1),  before  proceeding  to  stage 
(2). 

Actually,  a  processor  need  not  wait  for  its  neighboring  processor(s)  to  complete  their 
entire  sections  of  x™.  It  need  only  wait  for  the  points  to  which  its  section  is  coupled,  i.e., 
the  boundary  points  of  neighboring  processors,  to  be  computed.  If  the  work  of  each 
processor  can  be  ordered  so  that  boundary  values  of  x™  are  computed  first,  this  might 
result  in  less  waiting  time  at  the  synchronization  points.  This  will  be  referred  to  as 
boundary  synchronization. 

Synchronization  could  be  carried  to  even  higher  levels  of  refinement  -  all  the  way  to  a 
dataflow  type  of  parallelism,  in  which  each  element  of  r™  could  be  computed  as  soon  as  the 
needed  elements  of  x™  were  available.  This  might  result  in  even  less  waiting  time,  but  the 
execution  time  for  such  synchronization  -  announcing  and  checking  on  the  completion  of 
each  element  -  should  be  much  greater  than  that  of  neighbor  or  boundary  synchronization. 

In  the  following  sections,  barrier,  neighbor,  and  boundary  synchronization  will  be 
analyzed  with  regard  to  execution  time  and  waiting  time,  and  this  will  be  followed  by  a 
discussion  of  whether  it  is  worthwhile  to  use  more  refined  levels  of  synchronization.  The 
time    required    to    complete    a   program    using   these    forms   of   synchronization    will   be 
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compared  with  that  required  to  perform  the  same  operations  with  no  synchronization.  Only 
if  these  times  differ  significantly  would  it  seem  profitable  to  pursue  the  possibility  of  an 
asynchronous  version  of  the  algorithm. 

2.  Notation 

Assume  that  there  are  P  processors  and  Q  points  (referred  to  as  "syncs")  within  a 
program  at  which  some  sort  of  synchronization  is  required  to  implement  the  desired 
algorithm.  An  asynchronous  version  of  the  algorithm  can  be  implemented  without 
synchronizing  at  these  points,  but  it  is  still  required  that  each  processor  perform 
sequentially  the  stages  of  work  delimited  by  these  points. 

For  i  €  {1, ••■,?},  j  €  (1,--,Q},  and  Si  some  subset  of  the  processors,  let 
tJ,   Tji,   ta(i,j),   tgCi.j.SJ),  and  TJ  be  defined  as  follows: 

time  between  processor  i  leaving  sync  j  —   1  and  arriving  at  sync  j  (sync  0  is  start  of 
problem). 

time  between  processor  i  leaving  sync  j  —  1  and  computing  all  data  (boundary  points) 
needed  by  other  processors  to  proceed  past  sync  j. 

ta(io)  = 

time  for  processor  i  to  "announce"  that  it  has  arrived  at  sync  j,  or  that  it  has  computed 
all  data  needed  by  other  processors  to  proceed  past  sync  j. 

tc(iJ.SJ)  ^ 

time  for  processor  i  to  "check"  that  all  processors  in  set  S/  have  arrived  at  sync  j,  or 
have  computed  all  data  needed  by  other  processors  to  proceed  past  sync  j. 

total  time  from  start  of  problem  until  processor  i  leaves  sync  j. 

The  total  time  for  the  problem  to  complete  is  then  given  by 

max  -pQ 

i=l,-  -.P        ^i  • 
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3.   Analysis 

In  estimating  the  time  spent  in  synchronizing  processors,  we  will  consider  the 
execution  time  (announcing  and  checking  time)  and  the  waiting  time  separately.  An  upper 
bound  for  the  time  spent  synchronizing  is  given  by  the  sum  of  these  two  terms  and  a  lower 
bound  by  the  maximum  of  these  two  terms.  (This  lower  bound  would  be  met  if  the 
announcing  and  checking  on  arrival  of  processors  could  all  be  carried  out  while  waiting  on 
data,  or  if  no  waiting  were  necessary.) 

Using  the  previously-defined  notation,  then,  consider  first  the  case  of  no 
synchronization.  In  this  case,  the  total  time  for  processor  i  to  pass  the  point  in  the  program 
referred  to  as  sync  j  is  given  by 

TJ=   i   ti". 

k=l 

The  total  time  to  complete  the  program  is  the  total  time  for  the  slowest  processor  to  pass 
sync  Q: 

Total  Time  =  -J?^^  ^      f  t}  .  (1) 

i-i.--.^    j=l 

Now  consider  the  case  in  which  all  synchronization  points  are  barriers.  In  this  case, 
each  processor  must  announce  its  arrival  at  each  sync  point,  check  on  the  arrival  of  all 
other  processors,  and  wait  at  each  sync  point  for  all  other  processors  to  arrive.  The  time 
for  processor  i  to  pass  sync  j  can  be  estimated  as  follows:  First  consider  the  time  for 
processor  i  to  pass  sync  1.  This  time  is  less  than  or  equal  to  the  maximum  time  for  any 
processor  to  reach  sync  1  and  announce  that  it  has  reached  sync  1,  plus  the  time  for 
processor  i  to  check  that  all  processors  have  arrived  at  sync  1: 


T/<  e=T^-,P  ^^i  ^  ^^^^'^^^  ^  t,(i,l,P)  , 


where  P  is  the  set  of  all  processors  (1, •■,?}.  The  time  for  processor  i  to  pass  sync  1  is 
greater  than  or  equal  to  the  maximum  time  for  any  processor  to  reach  sync  1  and  announce 
that  it  has  reached  sync  1,  since  processor  i  must  wait  at  sync  1  for  all  other  processors.  It 
is  also  greater  than  or  equal  to  the  time  for  processor  i  to  reach  sync  1,  announce  that  it 
has  reached  sync  1,  and  check  that  all  other  processors  have  reached  sync  1,  since  this  is  all 
work  that  processor  i  must  perform.    Thus,  we  have 
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T.^  >  max{^^'y^^.^  p   (ti  +  t,(€,l)),   t/  +  t,(i,l)  +  t,(i.l,P)}. 
By  a  similar  argument,  the  time  for  processor  i  to  pass  sync  2  satisfies 

€=T-,P   ^'^^  ^  ^^^  ^  Ki€>2))  +  tc(i,2,P)  >t2> 
max{^^"j^_^.''.  p   (T]  +  tj  +  ta(€,2))  ,  T/  +  t.^  +  ta(i,2)  +  t,(i,2.P)}  . 

Substituting  the  bounds  on  T]  and  T/,  we  find 

2 

I      €=T-.P   ^^'  ^  ^^^^'""^^  ^  ^=T-,P   ^c(^.l.P)  +  tc(i,2,P)  > 

Ti2>max{     2      ^-T''- P   ^^^^  +  t^C^.k)). 
k=l    *^     '•     '^ 

2 

S      (tj"  +  ta(i,k)  +  t,(i,k,P)}. 
k=l 

Continuing  in  this  way,  it  can  be  seen  in  general  that  Tj  satisfies 

i      i=T^.  p   (^(  +  ta(€,k))  +  ""s      .."l^".  p   te(€,k,P)  +  t,(i,j,P)> 
k=l    '^     ^'     '^^  k=l    *^     '     '^ 

TJ  >  max{  i      .7^.''.  p   (tf^  +  ta(€,k))  . 

i    (ti'^  +  ta(i,k)  +  t,(i,k,p))} . 

k=l 

If  we  assume  that  the  announcing  and  checking  time  is  bounded,  independent  of  the 
processor  performing  it  and  the  point  at  which  it  is  performed, 

ta(€,k)   <  ta 

V€,k, 

t,(€,k,P)  <  t,(P) 
then  the  bounds  on  Tl  can  be  simplified  to 
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<■  max 


k=l 


^^l^.-p   t|  +  J*(ta+  t,(P))  >TJ 


max{     i      ._"?^^  p   il      i    (t^  +  ta(i,k)  +  t,(i,k,P))}. 
k=l    ^     ^'     '^  k=l 

From  this  it  follows  that  the  total  time  to  complete  the  program  satisfies 

max 


2  i J,*:'. p  tJ  +  Q.(t,  +  t,(P)) 

j=l  '   ''    '^ 

Q 


(2) 


Total  Time  >  max{     2     •    ',    ^  p  t/  , 
j=l    '     ^''    '^ 

Q 

i^?^!".        p  2  (tJ       +        tgdj)         +        t,(i,J,P))}  . 

Note  that  expression  (1)  and  the  upper  bound  in  expression  (2)  differ  in  two  ways. 
First,  expression  (1)  is  a  maximal  sum  of  times,  while  expression  (2)  contains  a  sum  of 
maximal  times.  The  difference  represents  the  waiting  time  at  barriers.  Second,  expression 
(2)  contains  an  additional  term  representing  the  time  to  execute  the  barrier.  The  lower 
bound  in  expression  (2)  shows  essentially  that  if  either  term  in  the  upper  bound  is  large, 
then  a  large  amount  of  time  will,  indeed,  be  spent  in  the  barriers. 

It  is  shown  in  [1]  that  a  barrier  can  be  executed  in  time  0(log2P).  It  can  also  be 
shown  that  this  is  the  optimal  order  of  time  for  executing  a  barrier.  Hence,  when  large 
numbers  of  processors  are  being  used,  the  mere  execution  time  of  a  barrier  can  become 
prohibitively  high.  One  cannot  expect  constant  speedup  on  a  fixed  size  problem  as  the 
number  of  processors  P  increases,  and  one  cannot  even  expect  that  if  the  problem  size 
grows  with  P,  so  that  the  work  per  processor  remains  fixed,  the  computation  time  will 
remain  fixed.  In  both  of  these  cases,  the  time  for  simply  executing  the  barrier  will 
eventually  outweigh  the  actual  work  time.  Thus,  the  execution  time  of  a  barrier  can 
present  a  serious  problem. 

On  the  other  hand,  if  barriers  can  be  replaced  by  neighbor  synchronization,  then  this 
problem  can  be  alleviated  to  a  large  extent.  Evaluating  the  time  for  processor  i  to  pass 
sync  j  if  all  syncs  are  neighbor  syncs,  one  finds,  in  the  same  manner  as  was  done  for 
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barrier  syncs,  that  T/  satisfies 

max       max       ...    max      ^      rt-'' +  t  Ci,  k^l  + 

max       max        ...    max       I         ,■    v_  i  clc-l^  +  ♦  ^  i  «;J^ 

^  xj  ^  .v,o„/     max       max  max      -i      /.k   ,    ,  /;    ^^^ 

>  TJ  >  max{  J  j_i    •■•  1      2-      (tj   +  tadk.k)), 

i     (ti''  +  ta(i,k)  +  t,(i.k,S,'')}. 
k=l 

where  each  set  S/  consists  of  processor  i  and  its  neighbors.  In  general,  there  will  be  a  fixed 
number  of  "neighbor"  processors,  independent  of  the  total  number  of  processors  being 
used.  In  the  setup  shown  in  Fig.  1,  a  processor  has  at  most  two  neighbors.  If  we  assume, 
again,  that  the  announcing  and  checking  time  is  bounded  by  a  time  depending  only  on  the 
number  of  processors  being  announced  and  checked  on,  then  the  bounds  on  T/  can  be 
written  as 

max       max  max      -i      t^  j.  ;*/.    4.  .  i^\\  > 

lj€SJ     lj-l€SJ  li€Si^    1^=1        " 

-rj  :>  ^ov/max       max  max      J.        k 

Tj"  s  max(.      .   .  j_i    •••    .       1     2,      tj   , 

i     (ti''  +  ta(i,k)  +  te(i,k,Sh}  , 
k=l 

where  y\  is  the  maximum  number  of  neighbors  of  any  processor.  The  total  time  for  the 
program  to  complete  satisfies 

Q 

max  max  max  max       ^     j   ,   r\*t,    j.  ♦  t^w  r*  /-j\ 

i=l...p      :     ,cQ      :  ,cQ-l       ■■•     :   ,cl       .2,     t/   +   Q*(ta  +   tc(Tl))    &  (3) 
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x„«oi  t;,^^  ^  ™o„/        max  max  max  max      ^     ,i 

Total  Time  >  max{   ;=,...  p     •   ^^q    •       ,^q-i    •■    ■     ni    .2     t/ 

Q 

.7^%       2      (tJ  +  t^dj)  +  tc(i,j,SJ))}. 
1     '-     -^    j=i 

The  summation  of  processor  times  appearing  in  (3)  is  greater  tiian  or  equal  to  the 
maximal  sum  of  times  in  (1)  but  less  than  or  equal  to  the  sum  of  maximal  times  in  (2). 
This  reflects  the  fact  that  while  some  time  will  be  spent  waiting  at  neighbor  synchronization 
points,  it  will  generally  be  less  than  that  spent  at  barriers. 

The  other  difference  between  expressions  (2)  and  (3)  lies  in  the  announcing  and 
checking  time.  While  the  execution  time  for  a  barrier  is  0(log  P),  the  execution  time  for 
neighbor  synchronization  is  0(1).  Thus,  with  neighbor  synchronization,  if  the  amount  of 
work  per  processor  remains  fixed  and  large  enough  to  make  the  announcing  and  checking 
time  negligible,  then,  in  a  given  time,  one  could  expect  to  perform  approximately  P  times 
as  much  work  on  P  processors  as  on  one,  provided  the  amount  of  waiting  time  at  the 
synchronization  points  is  small. 

Of  course,  if  a  fixed  amount  of  work  is  divided  among  more  and  more  processors, 
then  eventually  the  execution  time  of  a  neighbor  sync  will  outweigh  the  work  time  of  each 
processor.  If  the  synchronization  routines  can  be  made  very  efficient,  however,  then  this 
point  should  not  occur  until  each  processor  has  only  a  few  operations  to  perform  between 
sync  points.  In  this  case,  the  work  cannot  be  divided  much  further,  anyway. 
Asynchronous  algorithms  offer  a  small  advantage  here,  in  that  one  could  conceivably  go  to 
one  operation  per  processor  and  still  obtain  linear  speedup  with  an  asynchronous 
algorithm. 

The  situation  with  execution  of  synchronization  instructions  can  be  summed  up  as 
follows:  If  barriers  are  essential,  they  will  be  a  bottleneck  on  machines  with  large  numbers 
of  processors.  If  they  can  be  replaced  by  neighbor  syncs  that  are  efficiently  implemented, 
then  the  problem  is  gone  except  possibly  when  one  is  using  very  small  granularity. 

4.   Analysis  of  Waiting  Time 

The  remainder  of  this  paper  will  be  devoted  to  analyzing  the  waiting  time  at  the 
various  types  of  synchronization   points.     We   will  assume   that  the  work  per  processor, 
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between  sync  points,  remains  fixed  as  the  number  of  processors  increases,  and  try  to 
determine  whether  the  time  to  completion  remains  approximately  fixed  and  how  this 
completion  time  compares  with  that  required  to  perform  the  same  operations  with  no 
processor  synchronization. 

Boundary  synchronization  requires  the  same  announcing  and  checking  time  as  neighbor 
synchronization.  It  should,  however,  require  less  waiting  time  at  the  synchronization 
points.  While  an  expression  analogous  to  (1)  -  (3)  can  be  derived  to  give  bounds  on  the 
time  required  when  all  syncs  are  boundary  syncs,  it  is  complicated  and  unenlightening. 
The  best  way  to  understand  the  differences  in  waiting  time  for  the  various  synchronization 
forms  is  to  use  a  picture,  e.g..  Fig.  2.  The  lengths  of  the  various  line  segments  in  Fig.  2 
represent  the  work  time  required  by  each  processor  between  each  pair  of  synchronization 
points.  The  longest  total  length  is  the  total  time  required  with  no  synchronization,  as 
illustrated  in  Fig.  2a.  If  we  take  the  longest  segments  between  each  pair  of  synchronization 
points  and  add  these  segments  together,  then  we  obtain  the  total  time  required  with  barrier 
synchronization,  as  illustrated  in  Fig.  2b.  Figs.  2c-d  illustrate  the  total  time  required  with 
neighbor  and  boundary  synchronization,  respectively.  The  x's  in  Fig.  2d  represent  the 
points  in  the  computation  at  which  boundary  points  are  completed  by  the  various 
processors.  Note  that  while  there  is  some  waiting  time  with  boundary  synchronization,  it  is 
less  than  that  required  with  neighbor  synchronization.  Again,  Fig.  2  represents  only 
differences  in  waiting  time;  announcing  and  checking  time  is  now  being  ignored. 

The  amount  of  time  spent  waiting  at  synchronization  points  depends  on  the  variation 
in  times  the  processors  spend  working  between  sync  points.  If  the  work  times  between 
sync  points  for  the  various  processors  differ  by  only  a  small  relative  amount,  then  it  can  be 
shown  that  the  total  time  to  completion  (ignoring  announcing  and  checking  time)  using 
barriers  differs  by  only  a  small  relative  amount  from  that  using  no  synchronization.  To  see 
this,  assume  that  the  processor  work  times  t}  satisfy 

1      P 
iJf^.''.  p   t>  <  (1  +  f)tlvg  .    t^g  =  -    S    tJ   ,  j=l,--,Q  ,  (4) 

for  some  small  number  f.   Then  the  total  time  to  completion  with  barrier  syncs  satisfies 
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§     -7^%   tJ<    2    (1  +  f)^    S    tJ  (5) 

j=l   '    ^'     -^  j=l  ^  i=l 


1    P      Q    i 

P  i=l     j=l 


Q 


<(1  +  0  i=T^%     2    tJ. 


j=l 

Thus,  there  is  little  difference  between  the  total  time  with  barriers  (the  sum  of  maximal 
processor  times)  and  that  with  no  synchronization  (the  maximal  sum  of  times).  The  total 
time  with  neighbor  and  boundary  synchronization  lies  between  these  two,  and  so  there  is 
little  difference  here  either. 

It  should  also  be  noted  that  if  the  processor  work  times  differ  by  a  large  amount  but 
the  maximum  time  occurs  consistently  on  one  particular  processor,  then  although 
performance  will  not  be  optimal  due  to  poor  load  balancing,  the  time  required  with  barrier 
synchronization  will  be  exactly  the  same  as  that  required  with  no  synchronization.  That  is, 
if  a  processor  time  t^  satisfies 

then  the  sum  of  maximal  processor  times  is  equal  to  the  maximal  sum  of  processor  times: 

§  i=T''  p  ^.^  =  ^  '^  =  i=T''  p  §  tJ .  (7) 

j=i  '  1'  '^      j=i      '  ^'  '^  j-i 

Similarly,  of  course,  the  time  required  with  neighbor  and  boundary  synchronization  will 
also  be  the  same. 

While  the  relations  between  processor  times  described  above  are  quite  typical,  they 
are  not  the  only  possibilities.  From  the  above  arguments  it  follows  that  one  can  expect  a 
significant  difference  between  the  time  required  with  barriers  and  that  required  with  no 
synchronization  only  if  (I )  the  processor  times  between  a  pair  of  sync  points  differ  by  a  large 
relative  amount  and  (2)  the  maximum  processor  time  occurs  on  different  processors  at 
different  stages  in  the  program. 

To  explore  some  other  possibilities,  let  us  consider  the  processor  times 
tJ  ,  i=  1,  ■-,?,  j=  1,--,Q     as     random     variables.      We     will     consider     several     possible 
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distributions  for  these  random  variables  and  ask  how  the  total  time  to  completion  compares 
for  the  various  forms  of  synchronization. 

First  note  that  the  expected  time  Ebarr  ^'^h  barrier  synchronization  and  the  expected 
time  Enone  with  no  synchronization  satisfy 

Ebar:  =  E(     2     ^J^^.''.  p    t/)   =     2   E(  j^^^^.  p    t/)  (8) 


Q 


Enone  =  E(.  Jj^'^  p       1     t.J)   >     1     E(t/)   . 
'      ^'      '^    j=l  j=l 

As  the  number  of  processors  P  becomes  large,  it  becomes  likely,  at  each  stage,  that  some 
processor  will  require  almost  the  maximum  possible  time.  If  the  number  of 
synchronization  points  Q  is  fixed,  it  also  becomes  likely  (though  much  more  slowly)  that 
some  processor  will  require  almost  the  maximum  possible  time  at  all  of  the  Q  stages. 
Thus,  in  this  case,  as  the  number  of  processors  becomes  large  we  have 

Q 

hm  Ebarr  =  1™  E^one  =      ^     ^Lx  '  (9) 

P-X  P-K  j  =   J 

where  tj^a,  is  the  maximum  possible  value  for  the  t/'s,  or  infinity  if  the  t/"s  are  unbounded. 

For  the  type  of  problems  mentioned  in  the  introduction;  e.g.,  iterative  methods  for 
solving  sparse  linear  systems,  it  is  generally  the  case  that  the  number  of  synchronization 
points  Q  will  grow  as  the  problem  size  grows.  In  solving  Laplace's  equation,  for  example, 
the  number  of  iterations  and  hence,  the  number  of  synchronization  points  with  Jacobi's 
method,  grows  as  N.  With  the  conjugate  gradient  method,  it  grows  as  ^N.  With 
muiltigrid  methods,  while  the  number  of  multigrid  cycles  remains  fixed,  the  number  of  grid 
levels  and  hence,  the  number  of  synchronization  points,  grows  as  log  N.  Since  we  are 
assuming  that  the  problem  size  grows  with  P  so  that  the  work  per  processor  between 
synchronization  points  remains  fixed,  it  is  also  reasonable  to  assume  that  the  number  of 
synchronization  points  grows. 

Kruskal  and  Weiss  [4]  showed  that  if  the  t/'s  are  independent  identically  distributed 
random  variables  with  mean  jj.  and  variance  a  ,  and  if  Q  is  large  with  respect  to  log  P ,  then 
the  expected  value  of  the  maximal  sum  satisfies 
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ECj-T^!".  p     t     tj)  =^  Q(^  +  cTV2(logP)/Q)  .  (10) 

'     '•     '^  j=l 

Thus,  the  inequality  in  (8)  is  approximately  an  equality.  In  this  case,  then,  there  will  be  a 
large  difference  in  completion  time  using  barrier  synchronization  versus  no  synchronization 
if  the  expected  value  of  the  maximal  processor  time  is  much  greater  than  the  mean 
processor  time: 

(The  assumption  that  the  t/'s  are  identically  distributed  implies  that  the  maximal  processor 
time  is  as  likely  to  occur  on  one  processor  as  another.) 

Suppose,  now,  that  work  is  divided  evenly  among  the  processors  but  that  the  time  to 
execute  an  instruction  may  vary  due  to,  say,  memory  bank  conflicts.  If  the  instruction 
times  are  considered  as  random  variables  and  if  the  processor  times  are  sums  of  large 
numbers  of  instruction  times,  then,  by  the  central  limit  theorem,  the  processor  times  should 
be  approximately  normally  distributed.  Of  course,  it  must  be  a  truncated  normal 
distribution,  with  some  positive  minimum  value  based  on  the  minimum  instruction  time  and 
some  finite  maximum  value  based  on  a  worst-case  scenario  for  bank  conflicts.  Since  these 
minimum  and  maximum  values  are  usually  unknown,  however,  we  will  assume  a  normal 
distribution  truncated  only  on  the  lower  end  at  zero.  Fig.  3  shows  the  expected  total  time 
to  completion  using  the  various  synchronization  forms  on  from  1  to  1024  processors,  when 
the  processor  times  between  sync  points  are  normally  distributed  with  mean  1  and  standard 
deviation  .1.  In  computing  the  time  with  boundary  synchronization,  it  was  assumed  that 
boundary  points  could  be  computed  in  one  tenth  the  time  required  for  a  processor  to 
complete  its  entire  section.  Again,  the  curves  in  Figs.  3-5  are  based  only  on  waiting  time  at 
synchronization  points  due  to  variation  in  processor  times;  announcing  and  checking  time  is 
assumed  to  be  zero. 

Note  from  Fig.  3  that  with  1024  processors,  the  total  time  with  barrier  synchronization 
differs  from  that  with  no  synchronization  by  about  30  -  35%.  The  difference  between 
neighbor  synchronization  and  no  synchronization  is  considerably  less  -  about  10%,  and  with 
boundary  synchronization  the  two  curves  are  almost  identical.  Differences  would  be 
greater  if  the  standard  deviation  in  processor  times  were  larger,  but  a  10%  standard 
deviation   is  probably  not  an  unreasonably  low  estimate  for  most  systems  (in  fact,  it  is 

Ultracomputer  Note  98  Page  12 


probably  on  the  high  side).  If  processor  times  are  distributed  in  this  way,  then,  and  if 
neighbor  or  boundary  synchronization  can  be  used  so  that  execution  time  of  the 
synchronization  instructions  is  small,  then  there  would  seem  to  be  little  reason  for  trying  to 
develop  an  asynchronous  version  of  the  algorithm  or  for  considering  more  refined 
synchronization  techniques. 

On  the  other  hand,  even  when  the  work  is  divided  evenly  among  processors,  there  are 
sometimes  other  phenomena,  besides  bank  conflicts,  that  affect  the  timing  of  processors. 
Some  multiprocessors  (e.g.,  the  hypercube,  the  VAX  11/780-4,  etc.)  have  separate 
operating  systems  on  each  node.  These  occasionally  and  randomly  interfere  with  the 
execution  of  a  code  to  perform  various  operating  system  functions.  Other  multiprocessors 
(e.g.,  the  Balance  Sequent  8000)  have  system  daemons  that  perform  similar  sorts  of 
functions.  If  the  granularity  of  a  multi-processed  code  is  large,  then  this  interference 
results  in  just  a  small  relative  perturbation  of  the  processor  times.  In  this  case,  as  was 
shown  earlier,  the  total  time  with  any  of  the  synchronization  forms  varies  by  just  a  small 
fraction  from  that  with  no  synchronization.  If  the  granularity  is  small,  however,  then  the 
perturbations  produced  by  such  interference  may  be  large,  on  a  relative  scale. 

A  simplified  model  of  this  situation  might  have  processor  times  distributed  as  follows: 

1  ,  with  probability  p  (no  interference) 

t '  = 
J       1  +  s,     with  probability  1  -  p  (interference) 

Taking  p  to  be  .99  and  s  to  be  10,  one  obtains  the  results  shown  in  Fig.  4.  Here  the 
difference  in  the  expected  total  time  with  barrier  synchronization  versus  no  synchronization 
is  quite  significant.  With  1024  processors,  it  becomes  likely,  at  each  stage  of  the  program, 
that  some  processor  is  interrupted.  Hence,  with  barrier  synchronization  all  processors 
must  wait,  and  the  total  time  approaches  the  maximum  possible,  11  *Q. 

With  no  synchronization  and  Q  >>  log  P,  it  follows  from  the  Kruskal  and  Weiss 
result  (10)  that  the  total  time  approaches  approximately  Q  times  the  mean  time  between 
sync  points,  or,  1.1*Q.  This  is  also  seen  in  Fig.  4.  The  difference  with  neighbor  and 
boundary  synchronization  is  also  significant  but  considerably  smaller  than  that  with 
barriers.  With  1024  processors  one  loses  roughly  a  factor  of  1.3  —  2.3  in  speed  due  to 
waiting  at  neighbor  or  boundary  syncs.  Thus,  in  this  case,  there  might  be  some  reason  for 
considering    synchronization    techniques     that    require     less    waiting    or     for    pursuing 
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asynchronous  methods.  Still,  even  in  this  rather  extreme  example,  in  which  the 
interruption  time  is  ten  times  as  large  as  the  usual  processor  work  time,  the  gains  to  be  had 
are  not  tremendous.  It  might  be  more  profitable  to  use  a  multiprocessor  with  only  one 
operating  system  so  that  all  processors  are  interrupted  simultaneously,  or  to  simply  run 
problems  with  larger  granularity  to  drown  out  the  interference.  If  the  relative  interruption 
time  s  were  equal  to  1,  say,  instead  of  10,  then  the  curves  in  Fig.  4  would  look  qualitatively 
the  same,  but  with  the  vertical  axis  going  up  to  2  instead  of  11. 

The  above  models  assume  that  work  is  divided  evenly  among  processors  and  attempt 
to  account  for  various  other  influences  on  processor  times.  Another  possibility  is  that  the 
work  is  not  divided  evenly.  If  one  particular  processor  has  more  work  than  the  others 
throughout  all  stages  of  the  program,  then  it  was  shown  earlier  that  the  total  time  using  any 
of  the  various  synchronization  forms  is  the  same  as  that  with  no  synchronization.  If  work 
is  divided  unevenly  and  assigned  to  processors  in  some  random  fashion,  however,  then  one 
might  assume  a  uniform  distribution  of  processor  times. 

Fig.  5  shows  the  expected  total  times  using  the  various  synchronization  forms,  when 
the  processor  times  between  sync  points  are  uniformly  distributed  between  1  and  10.  With 
no  synchronization  and  Q  >>  log  P,  the  total  time  divided  by  Q  approaches  approximately 
the  mean  processor  time  between  sync  points  -  5.5.  With  barriers,  it  approaches  the 
maximum  processor  time  between  sync  points  -  10.  The  difference  is  less  than  a  factor  of 
2,  and  with  neighbor  or  boundary  synchronization  this  difference  drops  to  less  than  about 
50%. 

5.   Conclusions  and  Further  Remarks 

While  the  execution  time  of  a  barrier  grows  as  the  number  of  processors  becomes 
large,  there  are  other  simple  synchronization  forms,  that  can  often  be  used  in  place  of 
barriers,  whose  execution  time  remains  fixed.  For  most  reasonable  processor  time 
distributions,  the  time  spent  waiting  at  the  synchronization  points  is  modest.  Hence,  as 
long  as  the  granularity  of  a  program  is  large  enough  to  make  the  fixed  execution  time  of 
neighbor  or  boundary  synchronization  small  compared  to  the  work  time  of  each  processor, 
there  would  seem  to  be  little  to  gain  by  going  to  more  complicated  synchronization 
strategies  or  to  asynchronous  algorithms. 
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Why,  then,  have  these  ideas  received  so  much  attention?  Reviewing  the  literature  on 
asynchronous  algorithms  [e.g.,  2,3],  the  answer  seems  to  lie  mainly  in  the  fact  that  the 
synchronization  routines  used  have  generally  been  very  slow.  Combining  this  with  the 
relatively  small  granularity  of  test  problems,  one  finds  that  the  time  to  announce  and  check 
on  the  arrival  of  even  one  processor  is  non-negligible  compared  to  the  work  time  of  each 
processor.  In  [2],  for  instance,  it  is  reported  that  in  implementing  Jacobi's  method  on  two 
processors,  29.9%  of  the  time  was  spent  in  synchronization.  Since  the  work  was  divided 
evenly  between  the  processors  and  the  processor  speeds  were  the  same,  almost  all  of  this 
time  must  have  been  spent  in  simply  executing  the  synchronization  instructions,  rather  than 
in  waiting  for  a  processor  to  arrive. 

But  synchronization  instructions  need  not  require  so  much  time  to  execute,  as  is 
discussed  in  [5].  The  time  to  announce  that  a  processor  has  arrived  at  a  sync  point  should 
be  the  time  required  to  set  a  bit  in  shared  memory,  or,  on  a  machine  with  only  local 
memory,  the  time  required  to  set  and  broadcast  this  bit,  along  with  other  data  that  needs  to 
be  broadcast  anyway.  The  time  to  check  on  the  arrival  of  a  processor  should  be  the  time  to 
check  if  this  bit  is  on.  By  "spin-waiting"  on  this  bit,  a  processor  can  proceed  with  its  work 
as  soon  as  the  bit  is  turned  on.  To  avoid  tying  up  processors,  however,  most 
synchronization  routines  do  not  spin-wait.  Instead,  they  go  through  the  operating  system 
to  give  up  the  processor  if  the  other  processor  is  not  yet  ready.  This  causes  a  long  delay, 
often  on  the  order  of  several  thousand  machine  cycles. 

Yet  with  spin-waiting,  and  by  implementing  in-line  synchronization  instructions  to 
avoid  the  cost  of  a  subroutine  call,  one  should  be  able  to  implement  neighbor  or  boundary 
synchronization  in  a  time  comparable  to  that  required  to  perform  a  few  arithmetic 
operations.  One  could  then  use  fairly  small  granularity  (several  operations  per  processor 
between  sync  points)  and  still  expect  good  performance.  Only  for  the  smallest  granularity 
levels  (one  to  a  few  operations  per  processor)  would  it  be  necessary  to  go  to  asynchronous 
methods. 
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